更多特征变量却未能带来随机森林分类效果的提升

Original 生信宝典生信宝典 2022-07-05

收录于合集 #机器学习 38个

评估RFE变量筛选过程中构建的最终模型的效果

最终拟合的模型可通过rfe$fit获取，用于后续预测分析。

library(randomForest)
rfe$fit

## 
## Call:
##  randomForest(x = x, y = y, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 14
## 
##         OOB estimate of  error rate: 5.08%
## Confusion matrix:
##       DLBCL FL class.error
## DLBCL    43  1  0.02272727
## FL        2 13  0.13333333

但此模型没有进行调参。虽然用的变量多了，但预测效果没有比Boruta筛选的特征变量结果好。P-Value [Acc > NIR] : 0.2022不显著。

# 获得模型结果评估矩阵(`confusion matrix`)

predictions <- predict(rfe$fit, newdata=test_data)
confusionMatrix(predictions, test_data_group)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction DLBCL FL
##      DLBCL    14  2
##      FL        0  2
##                                           
##                Accuracy : 0.8889          
##                  95% CI : (0.6529, 0.9862)
##     No Information Rate : 0.7778          
##     P-Value [Acc > NIR] : 0.2022          
##                                           
##                   Kappa : 0.6087          
##                                           
##  Mcnemar's Test P-Value : 0.4795          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.5000          
##          Pos Pred Value : 0.8750          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.7778          
##          Detection Rate : 0.7778          
##    Detection Prevalence : 0.8889          
##       Balanced Accuracy : 0.7500          
##                                           
##        'Positive' Class : DLBCL           
##

基于RFE选择的特征变量再次调参构建模型

# 提取训练集的特征变量子集
rfe_train_data <- train_data[, caretRfe_variables$Item]
rfe_mtry <- generateTestVariableSet(length(caretRfe_variables$Item))

使用 Caret 进行调参和建模

library(caret)
# Create model with default parameters
trControl <- trainControl(method="repeatedcv", number=10, repeats=5)

# train model
if(file.exists('rda/rfeVariable_rf_default.rda')){
   rfeVariable_rf_default <- readRDS("rda/rfeVariable_rf_default.rda")
} else {
  # 设置随机数种子，使得结果可重复
  seed <- 1
  set.seed(seed)
  # 根据经验或感觉设置一些待查询的参数和参数值
  tuneGrid <- expand.grid(mtry=rfe_mtry)

  rfeVariable_rf_default <- train(x=rfe_train_data, y=train_data_group, method="rf", 
                     tuneGrid = tuneGrid, # 
                     metric="Accuracy", #metric='Kappa'
                     trControl=trControl)
  saveRDS(rfeVariable_rf_default, "rda/rfeVariable_rf_default.rda")
}
print(rfeVariable_rf_default)

## Random Forest 
## 
##  59 samples
## 216 predictors
##   2 classes: 'DLBCL', 'FL' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 53, 53, 54, 53, 53, 54, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##     1   0.9802857  0.9459213
##     2   0.9707619  0.9091146
##     3   0.9600952  0.8725321
##     4   0.9554286  0.8405432
##     5   0.9599048  0.8612016
##     6   0.9525714  0.8326301
##     7   0.9572381  0.8642968
##     8   0.9492381  0.8242968
##     9   0.9492381  0.8242968
##    10   0.9492381  0.8242968
##    16   0.9492381  0.8242968
##    25   0.9492381  0.8242968
##    27   0.9463810  0.8160615
##    36   0.9492381  0.8242968
##    49   0.9492381  0.8242968
##    64   0.9425714  0.8042968
##    81   0.9363810  0.7874901
##   100   0.9397143  0.7960615
##   125   0.9311429  0.7713556
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 1.

结果还是不显著P-Value [Acc > NIR]>0.05。效果弱于Boruta筛选出的特征变量构建的模型。

# 获得模型结果评估矩阵(`confusion matrix`)

predictions <- predict(rfeVariable_rf_default, newdata=test_data)
confusionMatrix(predictions, test_data_group)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction DLBCL FL
##      DLBCL    14  2
##      FL        0  2
##                                           
##                Accuracy : 0.8889          
##                  95% CI : (0.6529, 0.9862)
##     No Information Rate : 0.7778          
##     P-Value [Acc > NIR] : 0.2022          
##                                           
##                   Kappa : 0.6087          
##                                           
##  Mcnemar's Test P-Value : 0.4795          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.5000          
##          Pos Pred Value : 0.8750          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.7778          
##          Detection Rate : 0.7778          
##    Detection Prevalence : 0.8889          
##       Balanced Accuracy : 0.7500          
##                                           
##        'Positive' Class : DLBCL           
##

机器学习系列教程

从随机森林开始，一步步理解决策树、随机森林、ROC/AUC、数据集、交叉验证的概念和实践。

文字能说清的用文字、图片能展示的用、描述不清的用公式、公式还不清楚的写个简单代码，一步步理清各个环节和概念。

再到成熟代码应用、模型调参、模型比较、模型评估，学习整个机器学习需要用到的知识和技能。

观察｜官方通报陕西蒲城一职校学生坠亡：事发前与舍友发生口角和肢体冲突认定该生系高空坠落死亡

桐城一派｜倒在“跨年夜”的龚书记，13个字换来免职调查冤不冤？

比佟丽娅还恋爱脑，怀孕7次流产4次，目睹丈夫背叛却选择原谅

市管干部“龚书记”免职迷局

讣告！又一知名女星在家中去世，终年54岁，曾是无数人白月光…

更多特征变量却未能带来随机森林分类效果的提升

评估RFE变量筛选过程中构建的最终模型的效果

基于RFE选择的特征变量再次调参构建模型

您可能也对以下帖子感兴趣

观察｜官方通报陕西蒲城一职校学生坠亡：事发前与舍友发生口角和肢体冲突 认定该生系高空坠落死亡

桐城一派｜倒在“跨年夜”的龚书记，13个字换来免职调查冤不冤？

比佟丽娅还恋爱脑，怀孕7次流产4次，目睹丈夫背叛却选择原谅

市管干部“龚书记”免职迷局

讣告！又一知名女星在家中去世，终年54岁，曾是无数人白月光…

生成图片，分享到微信朋友圈

更多特征变量却未能带来随机森林分类效果的提升

评估RFE变量筛选过程中构建的最终模型的效果

基于RFE选择的特征变量再次调参构建模型

您可能也对以下帖子感兴趣

观察｜官方通报陕西蒲城一职校学生坠亡：事发前与舍友发生口角和肢体冲突认定该生系高空坠落死亡